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Estimating a difference between Kullback-Leibler risks by a 
normalized difference of AIC 



SUMMARY 

AIC is commonly used for model selection but the precise value of AIC 
has no direct interpretation. We are interested in quantifying a difference 
of risks between two models. This may be useful for both an explanatory 
point of view or for prediction, where a simpler model may be preferred if 
it does nearly as well as a more complex model. The difference of risks can 
be interpreted by linking the risks with relative errors in the computation 
of probabilities and looking at the values obtained for simple models. A 
scale of values going from negligible to large is proposed. We propose a 
normalization of a difference of Akaike criteria for estimating the difference 
of expected Kullback-Leibler risks between maximum likelihood estimators of 
the distribution in two different models. The variability of this statistic can 
be estimated. Thus, an interval can be constructed which contains the true 
difference of expected Kullback-Leibler risks with a pre-specified probability. 
A simulation study shows that the method works and it is illustrated on two 
examples. The first is a study of the relationship between body-mass index 
and depression in elderly people. The second is the choice between models 
of HIV dynamics, where one model makes the distinction between activated 
CD4-I- T lymphocytes and the other does not. 

Some key words : Akaike criterion, body-mass index, depression, HIV dy- 
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1 Introduction 

Since its proposal by Akaike (1973), Akaike information criterion (AIC) has 
had a huge impact on so-called "model choice", in particular in the appli- 
cation of statistical methods; see the presentation of deLeuwe (1992). It 
is often used in its original simple form, precisely because of its simplicity. 
Many variants of the criterion have been proposed. We may cite in particular 
the EIC (Konishi and Kitagawa, 1996; Shibata, 1997) which makes use of the 
bootstrap, extended to the choice of semi-parametric estimators by Liquet, 
Sakarovitch and Commenges (2004). Other criteria have been proposed such 
as the BIG (Schwartz, 1978) or approaches based on complexity (Bozdogan, 
2000). AIC is commonly used to select the "best" model on the basis of a 
sample and it is often forgotten that it is a statistic and as such has a dis- 
tribution (see Burnham and Anderson, 2002, and Shimodaira, 2001). When 
the goal is prediction or estimating a parameter which may be common to 
several models, the model averaging approach (Hoeting et al., 1999; Hjort 
and Claesken, 2003; Shen and Huang, 2006) may be used. 

One problem with AIC is that its value has no intrinsic meaning; in par- 
ticular AIC is not invariant to a one-to-one transformation of the random 
variables and values of AIC depend on the number of observations. Investi- 
gators commonly display big numbers, only the last digits of which are used 
to decide which is the smallest. If the specific structure of the models is of 
interest, because it tells us something about the explanation of the observed 



phenomena, it may be interesting to measure how far from the truth each 
model is. This may not be possible but we can quantify the difference of risks 
between two models. It may also be useful in prediction problems where we 
may prefer a simpler model, not only on statistical grounds but because of its 
very simplicity, if the increase of risk incurred by using it is not too large. Of 
course estimating the difference of risks will be informative only if we have 
an idea of what a large or a small difference is. 

We show that a normalized difference of AIC is an estimate of a difference 
of Kullback-Leibler risks. The distribution of this statistic can be estimated 
using the results of Vuong (1989) for non-nested models and results of Wald 
(1943) for the case of nested models. We give some examples of values of such 
differences to help develop an intuition of what a large or a small difference 
is. 

In section 2 we present two examples. One is the comparison of a linear 
and a non-linear effect of body-mass index (BMI) on depression using data 
from the Paquid study; the other is the comparison of two models of inter- 
action between HIV and the immune system. In section 3 we present the 
relevant Kullback-Leibler risk and we show that the normalized difference of 
AIC is an estimate of the difference of risks; moreover we propose a so-called 
"tracking interval" which should contain the difference of risks with a given 
probability; we also give insight in the interpretation of the differences of 
risks. Section 4 presents a simulation study in the framework of the logistic 
regression, which makes it possible to assess the properties of the proposed 
tracking interval. In section 5 we present an illustration on real data in the 
two examples. 



2 Motivating examples 

2.1 Comparison of linear and non-linear effect models 
of BMI on depression 

Our first example bears on the comparison of possible models of association 
of depression and Body-mass index (BMI) in elderly people, using the data of 
the Paquid study (Letenneur et al., 1999). We aim at assessing quantitatively 
the difference between estimators based on different models. 

As is conventional, depression was considered as a binary trait coded by 
a dichotomized version of the CESD (using the thresholds 17 and 23 for men 
and women respectively). The question here is to see whether there is a 
linear effect or if there is an optimal BMI, as far as depression is concerned. 
This problem is treated in the logistic regression framework. The simplicity 
of the problem makes it possible to design a simulation study which looks 
like this real data problem. 

We worked with the sample of the first visit of the Paquid study and we 
excluded the subjects who were diagnosed demented at that visit: the sample 
size was 3484. We fitted logistic regression models for explaining depression 
from BMI, age and gender. We entered age, gender and their interaction as 
explanatory variables. As for BMI which was the factor of main interest, we 
tried a linear (in the logistic scale) model and then we challenged the linear 
model by trying a categorization of BMI in terciles and a quadratic model. 
Specifically it is interesting to see, if there is an effect of BMI, whether there 
is a linear trend or there is an optimal region of values of the BMI (as far 
as depression is concerned). We also tried a more complex model involving 



simple powers of weight and height. 

2.2 Comparison of two models of interaction between 
HIV and the immune system 

Models of the interaction between HIV and the immune system have had 
a high impact on the research in the pathology induced by HIV (Ho et al., 
1995, Perelson et al., 1996). These models are based on ODE systems re- 
flecting the mechanisms of infection of CD4+ T Lymphocytes (called CD4 
for short) and the production of viruses by infected cells. A possible model, 
denoted A^i, is graphically represented in Figure [I] (a); see Appendix for the 
description of the system of ordinary differential equations (ODE). Rather 
than making a patient-by-patient analysis, random effect models (Putter et 
al., 2002) make it possible to analyze a sample of subjects, thus yielding more 
precise estimates of the parameters. The statistical estimation in these mod- 
els is challenging because (i) the ODE systems have no analytical solution; 
(ii) computation of the likelihood involves numerical multiple integrals. 

It may be useful to distinguish between quiescent and activated CD4 
because it seems that only activated CD4 can be infected (De Boer and 
Perelson, 1998). Guedj, Commenges and Thiebaut (2007) analyzed such a 
model, denoted ^A2, represented in Figure [1] (b); see Appendix for details. 
However this model is more complex and therefore numerically more chal- 
lenging. Moreover only the total number of CD4 is measured. So one may 
wonder whether the possible gain obtained with this model is worth the ad- 
ditional complexity. One way to study it is to estimate the difference of 
KuUback-Leibler risks between the two models. Bortz and Nelson (2006) 

6 



used an information complexity criterion and AIC to select between HIV 
dynamics models but could not quantitatively assess the difference between 
models. We will attempt to estimate the difference of Kullback-Leibler risks 
between A^i and A^2 using data of a clinical trial. 

3 Theory about inference of differences of AIC 
criteria 

3.1 Estimating a difference of Kullback-Leibler diver- 
gences 

Consider a sample of independently identically distributed (iid) random vari- 
ables Yn = {Yi,i = 1 . . . ,n) having probability density function (pdf) / = 
/(.). Let us consider two models : (g) = {g^{.))p^B^B C "R^ and {h) = 
(/in.)),er,rc3?^. 

Definition 1 (i) (g) and (h) are non- overlapping if (g) fl (h) = 0; (ii) [g) is 
nested in (h) if (g) C (h); (Hi) (g) is well specified if there is a value P^ E B 
such that g^* = f ; otherwise it is misspecified. 

The log-likelihood loss of g'^ relatively to / for observation Y is log ^^jy)- 
The expectation of this loss under /, or risk, is the Kullback-Leibler diver- 
gence (Kullback, 1968) between / and /: KL(^^, /) = E/[log^^]. We 
have KL{g^, /) > and KL(/, /) = implies that g^ = /, that is f3 = (3^. 
The Kullback-Leibler divergence is often intuitively interpreted as a distance 
between the two pdf (or more generally between the two probability mea- 



sures) but this is not mathematically a distance; in particular the KuUback- 
Leibler divergence is not symmetric. It may be felt that this is a drawback, 
and in particular it makes any graphical representation perilous. However 
this feature may also have a deep meaning in our particular problem: there 
is no symmetry between /, the true pdf, and g^, a possible pdf. So we shall 
take on the fact that the Kullback-Leibler divergence is an expected loss 
(with respect to /) and not a distance. We assume that there is a value 
13q E B which minimizes ¥Aj{g^,f). If the model is well specified (3q = /9^,; 
if the model is misspecified KL{g^°,f) > 0. The MLE /3„ is a consistent 
estimator of Pq. 

We shall say that (g) is closer to / than (h) (avoiding to qualify (g) 
as "better" which may be misleading in this context) if KL(5f'^", /) < KL(/i'''", /). 
We have KL(/,/) = E/[log/(r)] - E/[log^'^(r)]. We cannot estimate 
KL(5f'^",/) because the entropy of /, H{f) = Ef[\og f(Y)], cannot be cor- 
rectly estimated. However, we can estimate the difference of risks A ((7*, h'^°) = 
KL{g^^\ f) — KL(/i'^", /), a quantitative measure of the difference of misspec- 
ification by —n^^(Ll, — L'h'"). 

^ in ^n ^ 

This result may not be completely satisfactory in practice if n is not very 
large because the distribution we will use is g^" rather than (7'^". Thus it 
is more relevant to consider the risk E/[log |^ /, ] that we call the expected 
Kullback-Leibler risk (or simply Kullback-Leibler risk) and that we denote 
by EKL(5f'^", /). This is the point of view introduced by Akaike (1973). 

Akaike's approach was revisited by Linhart and Zucchini (1986) who 
showed that: 

EKL(/", /) = KL(/", /) + ^n-iTr(/; V,) + o{n~^), (1) 



where /, = -E,[^^^M^|,J and J, = E,{f-^^^\a'-^^^Lf}- This 
can be nicely interpreted by saying that the risk EKL{g^" , /) is the sum of the 
misspecification risk KL{g^°, /) plus the statistical risk ^n^^TT{I^^Jg). Note 
in passing that if (g) is well specified we have KL{g^°, /) = and Ig = Jg, 
and thus EKL(/", /) = ^ + o{n~^). 
We also have: 

EKL(/",/) = -Ef{n-'L^l") + H{f) + ^Tr(J;V,) + o,(n-i). (2) 

Here we have essentially estimated E/[log5'*(F)] by Ef[n~^L^ "] but because 
of the overestimation bias, the factor ^ in the last term disappears; thus the 
term ^Tr (J^^^Jg) is the sum of two equal terms, the statistical error and the 
estimation bias of the misspecification risk (of course the misspecification 
risk is estimated up to the constant H{f)). Akaike criterion {AlC{g^") = 
— 2Lj> + 2p) follows from (2) by multiplying by 2n, deleting the constant 
term H{f) replacing Ef{n^^L^") by n~^L|>" and replacing Tr( J~^Jg) by p. 

What we really want to estimate is A{g^",h''") = EKL(/", /)-EKL(/i^", /). 
Using (|2| we obtain: 

E/ {-n-H^f - ^t - [Tr(/;V,) - Tr(/, V.)]}} = A{g^-,h^-)+o,{n-'). 

Using the Akaike approximation Ti{Ig^Jg) k, p, we obtain a simple estimator 

of A(/",/i^"): 

D(/",/,7.) = ^n-i[AIC(/")-AIC(/.>0] = -r^-ML|f"-LK"-(p-g)]. (3) 

Ef[D{g^^,0") - A(/",/i'^")] is an o(n"^). Thus, in contrast with AIC, 
D{g^",h^'^) has an interpretation since its expectation tracks the quantity 



9 



of main interest A{g^",h^") with pretty good accuracy. Moreover it has 
important invariance properties. 

Lemma 1 (Invariance properties) Both A(gf^" , h^") and D{gf^",h^") are 
invariant under re-parametrization, one-to-one transformation of the ob- 
served variables and change of the reference probability. 

The proof is straightforward. It can be noted that AIC itself is invariant 
under re-parametrization but neither under one-to-one transformation of the 
observed variables nor change of the reference probability. 

3.2 Tracking interval for a difference of Kullback-Leibler 
divergences 

We propose a "tracking interval" for A ( (7^" , /i"^" ) . This is not a usual con- 
fidence interval because A{g^",h^") changes with n. Although it converges 
toward A{g^°, /i'^°) we wish to approach A{g^", W") for values of n for which 
the Akaike correction is not negligible. 

We focus on the case where g^° 7^ /i'^". Using Theorem 3.3 of Vuong 
(1989), which is valid under conditions clearly stated by this author, we 
obtain that in that case: 



n'/^[D{g^",h^-)-A{g''",h^-)] ^> Af{0,u;^,), 



(4) 



where cj? = var 
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From this we can compute the tracking interval (A„,5„), where A^ = 
D{g'^",h'^'') - Za/2n~^^^'^n and Bn = D{g'^'\h'^") + ZQ,/2n"^/^a;„, where 1 - 
^{za/2) = olJI and $ is the cdf of the standard normal variable. This interval 
has the property: 

P^K < A(/", W-) < 5„] ^ 1 - a, 

where Pf represents the probability with density /. The assumption (yf^" 7^ 
h"^" is necessarily the case if the models do not overlap and may also be often 
the case even if the models overlap or are nested. However in the latter 
case the convergence toward the normal may be slow and it is desirable to 
construct confidence and tracking intervals compatible with the likelihood 
ratio test. 

3.3 The case of nested models 

In the case of nested models {g) C (h) the likelihood ratio test is often 
used to test whether the true distribution / is in (g). It can be used in 
the more general case where (h) (and hence (g)) is misspecified. In that 
case the null hypothesis Hq that can be tested by the Likelihood ratio test 
is (yf* = h'^°; that is, the closest distribution to / in (h) is in (g). Let us 
define LR = L|> — Lf" . The asymptotic distribution of 2LR under the null 
hypothesis is Chi-square with q — p degrees of freedom. If Ho is true we have 
KL(/o, /) = KL(/iTo, /) and we deduce from Q that A{g^",h"'") ^ ^ < 0. 
Thus if Hq is true the risk of g^" is always lower than that of h^", so we 
should work with (g). 



11 



If however Hq is not true we have KL{h'^°) < KL{g"°) so that 

A(/.,/,7.)>E_^. (5) 

Since ^^ is negative it is possible, if the difference of misspecification risks 
is small enough, that the risk incurred with (g) is smaller than that incurred 
with (h). Also, if Hq is not true, the LR statistic has a completely different 
asymptotic distribution than when Hq is true. This is a normal rather than 
a Chi-square distribution, and even more important, there is a scaling factor 
n~^/^ (see M|), showing that the LR statistic is an Op{n^^'^) and no longer 
an Op{l). A practical question arises: is there a transition between two so 
different distributions ? When Hq is not true but we are not far from it, that 
is \A{h'^°,g'^°)\ is small, the convergence toward the normal may be slow, 
so at finite distance we may be in between the chi-square and the normal. 
In particular we know that D > {p — q)/'n; a normal distribution giving 
non-negligible probability to {D < {p — q)/n} would not be satisfactory. 

Wald (1943), see also Kendall and Stuart (1973), showed that under the 
alternative hypothesis, the likelihood ratio statistic {—2LR) has approxi- 
mately a non-central chi-squared distribution with q — p degrees of freedom 
(dof). We adopt this distribution and express the non-centrality parame- 
ter S in term of A ((?* , /i^'^ ) . We deduce from equations (1) and (3) that 
E[—2LR] ^ 2nA{g^'^ , h'^°) + q — p. Since the expectation of a non-central 
chisquare with dof = q — p is 6 + q — p we obtain 6 ~ 2nA{g^°, /I'^o). For 
A[g^°, h''°) = we retrieve the Xg-p distribution for the classical test of the 
null hypothesis using the likelihood ratio statistic. This distribution is also 
compatible with the asymptotic normal distribution given by Vuong (1989). 
Indeed, for fixed A{g^'\h"'°), we have (5 ^ oo when n — * oo, and we know 
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that the non-central chi-squared distribution tends to a normal when S ^ oo 
(Evans, Hastings and Peacock, 1993). This also entails that, for fixed n, the 
normal approximation will be better for large A ((7'^", h'^°). 

Now suppose that we wish to test "A{g^'-',h'^°) = Aq". We are in the 
ideal situation of simple hypothesis testing where we can apply the Neyman 
Lemma. That is, the rejection region of the test is formed by all the values 
having the lower values of the density of the test statistic. Typically the 
rejection region will be (c, 00) (resp. {cinf,Csup)) for small (resp. large) 
values of Aq. The test can be inverted to form a confidence interval for 
A[g^'^ , W"): the 1 — a confidence interval is formed of all the values Aq 
which are not rejected by the test at level a. This confidence interval is by 
definition compatible with the likelihood ratio test, since will not be in the 
interval if "A((7*, /i'^") = 0" has been rejected by the test (which precisely 
assumes a Xg-p distribution for Aq = 0). From this confidence interval for 
A[g^°, h'^'^), say (A'„, B'^), we can deduce the tracking interval for A{g^", h'^") 
by subtracting to the bounds the additional statistical risk incurred with 
(h), that is (g - p)/2n: A^ = A[^ + {j> - (iM'^n; 5„ = B'^ + {p - q)/2n. It 
is not impossible that A„ be negative, even if ^^A{g^",h'^°) = 0" has been 
rejected. Indeed, if we reject Hq using the likelihood ratio test, we reject 
A[g^", W") = ^^ but we do not reject negative values of A{g^", W") larger 
than 2=2. 

2n 

In practice, the computation of the intervals may be done by comput- 
ing the p- value for each value Aq. Let /aj, and Faj, be the pdf and cdf of 
the non-central chi-squared distribution with q — p dof and non-centrality 
parameter 2riAo. If /ao(^) > /Ao(^2Li?) for all x < —2LR, the p-value is 
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simply 1 — -Fao(— 2Li?). This situation occurs for small values of dof and non- 
centrality parameter. If this is not the case the rejection region includes an 
interval (0, Ci„/) so the p-value is l-FAo(-2Li?)+FAo(ci„/) where /ao (cm/) = 
/ao(^2L_R). In practice it may not be easy to find Cj„/ unless a special pro- 
gram is available. We propose to look at the quantile of (1 — -Fa(,(— 2Li?))/2, 
say qpv/2- If fAo{(lpv/2) > /Ao(^2Li?) we can take p-value= 1 — Fao(— 2L_R); 
if fAo{Qpv/2) < /Ao(-2Li?) we take p-value= 2(1 - F^^^{-2LR)). 

3.4 How to interpret a difference of Kullback-Leibler 
risks 

It is important to judge whether the values within the intervals correspond to 
large or small expected losses. The Kullback-Leibler risk takes values between 
and +00 but in practice most of the risks or difference of risks that we 
encounter are lower than 1. To give an idea of how to interpret these values 
we may relate them to relative errors made in evaluation of probabilities 
as in Commenges et al. (2007). We will make errors by evaluating the 
probability of an event A using a distribution g, Pg{A)^ rather than using 
the true distribution /, Pf{A). For instance we may evaluate the relative 
error re{Pg{A)^Pf{A)) = J tj^) ■ Consider the typical event on which 
Pf{A) will be under-evaluated defined as: A = {x : g{x) < f{x)}. To 
obtain a simple formula relating KL{g, f) to the error on Pf{A) we consider 
the particular case Pf{A) = 1/2 and g/f constant on A and A^ . In that 



case we easily find: re{Pg{A),Pf{A)) = ./i^^F^klUJ) ^ ^2KL{g,f), the 
approximation being valid for small KL value. For KL values of 10^^, 10^^, 
10"^ 10"^ we find that re{Pg{A),Pf{A)) is equal to 0.014, 0.045, 0.14 and 
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0.44, errors that we may qualify as "negligible", "small" , "moderate" and 
"large" respectively. 

As already noted we can give an interpretation of EKL from ([I]) as the sum 
of the misspecification risk KL{g^°, /) and the estimation risk, approximated 
by p/2n. For a well specified model the risk is about p/2n; for instance it 
is 10~^ if p = 10 and n = 500, or if p = 1 and n = 50. The statistical risk 
associated to the estimation of one parameter is negligible, small, moderate 
and large for n = 5000, 500, 50, 5 respectively. The correspondence between 
the different scales is summarized in Table 1. We may also measure on this 
scale the magnitude of the Akaike correction of {p — q)/n. 

As an example the KL divergence of a double exponential relative to a 
normal distribution with same mean and variance is of order 10^^ what may 
be called a "large" value. As another example we may compute the risk 
incurred when using a normal distribution of variance a"^ when the true dis- 
tribution has variance one. It is easy to compute that the Kullback-Leibler 
risk is ^[loga^ — 1 + ;^]: this expression takes the value for a^ = 1 and 
tends toward +oo if o"^ tends toward +oo or 0. The values obtained for 
a^ = 1.02; 1.1; 1.3; 2 are respectively = 0.0001; 0.002; 0.016; 0.096 correspond- 
ing approximately to the the negligible, small, moderate and large levels. To 
approach a risk of 1, one has to take very large values of cr^: the risk is 0.65 for 
(T^ = 4 and 0.91 for a^ = 16. Finally we give the correspondence between the 
KL divergence and the odds-ratio in a particular case of a binary variable with 
Pf{Y = 1|X) = 1/2, while \ogit[Pg{Y = 1\X)] = /3X, X being itself a binary 
variable taking values 1 or —1 with probability 1/2. We have KL{g, f) = 
E{l/21og[ p (yLi\x) "*" l/21og[ p (yLoix) ^' where the expectation bears on X. 
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After some algebra we find that KL{g, f) = l/21og[l/2(l + cosh.{f3))]. Tlie 
values of the odds-ratio (0R= e^) giving negligible, small, moderate and 
large divergences are 1.03; 1.1; 1.35; 2.5 respectively. It is important to re- 
alize that this correspondence depends on the joint distribution of both Y 
and X; higher values of OR are associated to the same divergence levels for 
P/(r = 1|X) ^ 1/2 or P(X = 1)^1/2. 

A question which arises is whether the Kullback-Leibler risks are com- 
parable when Y is multivariate and when Y is univariate. If we have n 
independent univariate variables and we group them in vectors of size m, 
we obtain n' = n/m multivariate observations. To get the same estimator 
of the difference of risks between two models we should divide by the n'm 
rather than by n'. Thus in case of multivariate data we propose to divide the 
difference of AIC by the total number of measurements to get a value that 
is more comparable to situation where the variables are univariate. 

3.5 Extension to regression models 

All that has been said can be extended to regression models {gY\x) = (fi'Y|x(-l-))/3eB 
and (/iy|x) = (^y|x(-l-))7er- This can be done as in Vuong (1989) by directly 
defining the Kullback-Leibler divergence in term of conditional densities: 
KL(5fy.|^, /y|x) = Ej[log ^'^ ], where the expectation is taken for the 

^Y\X ^ I ' 

true distribution of the couple Y,X. However this approach has the draw- 
back of requiring a new definition of the Kullback-Leibler divergence . The 
so-called reduced model approach (Commenges et al., 2007) is more satisfac- 
tory. Consider a sample of iid couples of variables {Yi, Xi),i = 1, . . . , n having 
joint pdf/, f{y,x) = fY\xiy\x)fxix). Consider the model (g) = (g^i-, .))f3eB 
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such that g^{y,x) = gY\-^{y\x)fx{x) ; the model is called "reduced" because 
fx{-) is assumed known. The KuUback-Leibler divergence is: 

KL(/,/) = E;[log/y|x(r|X)]-E^[log(7f|^(r,X)], 

that is the term in fx{-) disappears (so that we do not need to know it 
in fact) and we get the same definition as in Vuong (1989) using only the 
conventional KuUback-Leibler divergence . 

4 Simulation study 

4.1 Study of the tracking interval in a non-nested case 

We performed a simulation resembling the situation of the Depression-BMI 
application where we have to choose between different logistic regression 
models. We considered iid samples of size n of triples (1^, x\, x\),i = 1, . . . ,n 
from the following distribution (which plays the role of the true distribu- 
tion /). The conditional distribution of Yi given {x\,x\) was logistic with 
logit[/y|x(l|a;i,4)] = 0.5+x\+2xi, where /Y|x(l|a;i,x*2) = Pf{Yi = l\x\,xi); 
the marginal distributions of {x\,xD were bivariate normal with zero ex- 
pectation and variance equal to the identity matrix. We considered model 
(g) specified by logit[5'y|j(^(l|a;*i,X2)] = /3o + j3ix\ + I32x\, which was well 
specified and the (mis)specified model (h) defined as logit[/iy|;sc(l|a^i,a^2)] = 
7o + Ya=i li^ii + 732^2; where x\i were dummy variables indicating in which 
categories x\ fell; the categories were defined using terciles of the observed 
distribution of xi, and this was represented by two dummy variables: x\i 
indicating whether x\ fell in the first tercile or not, x\2 indicating whether 
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x\ fell in the second tercile or not. 

Since model (g) is well specified we know that g^° = /, that the misspeci- 
fication error KL{g^°, /) is zero and that TT{Ig^Jg) = p. As for model (h) we 
must compute the quantities of interest by simulation. We can compute that 
in the logistic regression the /, k term of the matrix Jh is E/ [xi{Y— ^1^x-,q ) ^a^fc] , 
and that the /, k term of the matrix Ih is Ef[xijj^^^^Xk]- We estimated 70 
by fitting model (h) on a simulated data set with n = 10^. Our precise esti- 
mate 7o was thus % for n = 10^. We used it to precisely estimate Jh and Ih 
as ih = lO-'El'JAxl^^^^xl] and J, = 10"^ E!£iN(>^. - I^)M]- 

We estimated KLf/i^o, /) by lO^^ EJ-i log -^^[^(^'l^i'^^) ^ j computed 

a precise estimate of u;^, tu^, by the empirical variance of log 4^^— fpr^^ 
computed on 10^ replicas. Thus we can compute a precise estimate of 
EKL{h'^",f) and EKL{g^",f) by replacing the terms on right-hand of (1) 
by their estimates. Because {g) is well specified we obtain immediately 
EKL(t/^", /) ^ 1^; a precise estimate of EKL(/", /) - EKL(/i^", /) is thus 
given by A = 1^ - KL(/i^", /) - ^Tr(/^ V,,). We find first that KL(/i^«, /) ^ 
7.28 10^'^, a value approaching the "moderate magnitude". We found 3.998 
and 3.999 for the values of Ti{I^^Jh) for n = 250 and n = 1000 respec- 
tively. These values are very close to g = 4 (that would obtain if (h) 
was well-specified) so, in the following we will use this approximation. Us- 
ing this approximation we can compute A = — ^ — KL(/i'^'',/) and obtain 
A = -9.28 10-3 for n = 250 and A = -7.78 10"^ for n = 1000. We also find 
o)^ = 1.44 10~^. We can then compute the standard error of D as n^^^^ci)* 
and find 7.59 10"^ and 3.79 10"'^ for n = 250 and n = 1000 respectively. 
We see at once that there is more chance that the tracking interval does not 



contain zero for n = 1000 than for n = 250. 

We generated 1000 replications from the above model for n = 250 and 
n = 1000. For each replication we computed the maximum likelihood esti- 
mates and the AIC We computed the histogram of D{g^", 0") (see Figure 
2): its shape is approximately in accordance with the asymptotic normal 
distribution for both sample sizes; the empirical mean was —9.50 10^^ and 
—7.67 10^'^ for n = 250 and n = 1000 respectively, close to the values of A. 
The empirical variance of D (not shown) was in agreement with the theo- 
retical variance computed from ci)^. The mean of the estimated variances uj'^ 
was 1.88 10~^ and 1.54 10"^ for n = 250 and n = 1000 respectively, also 
reasonably close to the ul. The proportion of replicas for which A was out- 
side the .95 tracking interval was 0.045 and 0.053 for n = 250 and n = 1000 
respectively. The proportion of replicas for which zero was outside of the 
tracking interval was 0.197 and 0.514 for n = 250 and n = 1000 respectively, 
and in all cases (g) was preferred to (h). These results are summarized in 
Table 2. 

The results of the simulation are in accordance with the asymptotic the- 
ory. From a practical point of view, the variability of D seems to be large 
so that it is difficult to be sure that an estimator is better than another one 
if the difference of risk is small or moderate. Note that this variability is 
not specific to our approach but is a fact applying to any criteria based on 
likelihood ratio. For instance in the simulated situation for n = 250 there 
is a probability of about 12% that D{g^",h^'") takes a positive value (thus 
suggesting the wrong choice) and this probability is exactly the same for 
AIC. 
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4.2 Quality of the fit by the non-central chi-squared 
distribution in the nested case 

We performed another simulation for the case of nested model, to check the 
quality of the approximation of the distribution of —2LR by the non-central 
chi-squared distribution. We made two simulations with true distributions 
/^, specified by : logit[/y|;sf (l|a;i,a;2)] = 0.5 + Q.2x\ + 2x1 ^^^ P^ specified 
by: logit[/y|jf (l|x*]^,a;2)]) = 0.5 -|- Q.'ox\ + 2x\. For both cases we consid- 
ered two models: {g) and {h) with logit[(y'y-|^(l|x5^,X2)] = /9o + 132^2 ^^<i 
logit[/iy|j^(l|a;^,a;2)] = 7o + lix\ + 722:2; so that {h) was well specified while 
((?) C {h) was misspecified. However if /^ is the true distribution the differ- 
ence of risks using {g) and {h) is of "small" magnitude (^ 10^^) while if p 
is the true distribution it of "moderate" (~ 10^^) magnitude. The distribu- 
tions of {x\,X2) were as in the first simulation above. We simulated 10000 
replications of samples of size n = 1000 from /^ and /^ and in both cases 
we studied the fit of the non-central chi-squared distribution for the distri- 
bution of —2LR. The dof was equal to 1 and we took the expectation equal 
to the mean, from which we deduced the non-centrality parameter. Figure 
3 displays the histograms and the non-central chi-squared densities for both 
cases. The fits are nearly perfect and we also see that the distribution is 
closer to the normal for /^ than for /^. It is clear that the convergence to 
the normal is slow in the case of nested models unless the difference of risks 
is large. 
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5 Applications 

5.1 Relation between BMI and depression: analysis of 
the Paquid data 

The values of AIC, and the D statistic and tracking intervals (taking as 
reference the linear model) are given in Table 3. The tercile model had a 
larger AIC than the linear model but the point estimate (D) of the difference 
of risks was lower than 10~^ a level that we have qualified "negligible", and 
zero was well inside the tracking interval. So from the point of view of 
Kullback-Leibler risk there was no evidence that one model is better than 
the other. When it comes to comparing the linear and the quadratic model, 
because the first is nested in the second, we can use the likelihood ratio test: 
the null hypothesis is that the best distribution is in the linear sub-model. 
The hypothesis was strongly rejected {p < 0.01). We tend to conclude that 
the shape of the effect is not linear and that we may approach it better with 
a quadratic term. However it is interesting to estimate the difference of risks 
between the two models. The point estimate of the difference of risks was 
0.0007, a value which approaches the 10~^ level that we qualified to be a small 
(but not negligible) difference. Since {g) C {h) we computed the tracking 
interval applying the version of the tracking interval for nested models of 
section 3.3. The computation was done using using the pchisq, dchisq and 
qchisq R functions. We found (0.00012; 0.0030) for the confidence interval 
of A(5f*,/i'^") and, subtracting the increased statistical risk {p — q)/2n = 
0.00014, we found (-0.00002; 0.0029) for the tracking interval. Thus we 
are not completely sure to incur a smaller risk with the quadratic model. 
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However, if the difference of risks was not in favor of the quadratic model, 
this would be completely negligible. The difference of risks in favor of the 
quadratic model may be negligible or of small magnitude. 

In conclusion there is no reason to prefer the tercile model to the linear 
model but there are some reasons to prefer the quadratic model to the linear 
model. Figure 4 shows the shape of the effect of BMl with the quadratic 
model, taking as reference the median BMI (equal to 24.2). This is a U- 
shaped curve yielding the lower risks of depression for medium values of the 
BMl, somewhat shifted however toward large BMl. Of course the epidemi- 
ological interpretation of this result is delicate and the apparent effect that 
we have detected is the consequence of complex biological and psychological 
mechanisms that we do not attempt to explore here. Several other studies 
have found links between BMI and depression (Bergdahl et al., 2007; Bjerke- 
set et al., 2008). 

Since BMI is a combination of weight and height one may wonder whether 
it is possible to find a better model directly using simple powers of height and 
weight in the linear predictor. It happens that the model including weight, 
height, weight^, height^ and 1/height, that we denote (w) = {w^)e,=e, has a 
better AIC than the quadratic (in BMI) model, (h). Note that (h) is not 
nested in (w). Following the conventional use of AIC we should prefer (w) 
to (h). However (w) lacks readability because it involves a combination of 
weight and height that has never been used. For instance a nice graphical 
representation of the effect of weight and height such as presented in Figure 
4 is not possible. So we have non-statistical reasons to prefer (h) over (w). If 
we examine the statistical reasons to prefer {w) over {h) they are very thin. 
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First, the point estimate of A{h^",w^") is D = 0.0003, of the "neghgible" 
order of magnitude. Second, the tracking interval is [—0.0016; 0.0022]: zero 
is well inside this interval, so there is no confidence that we incur a lower 
risk using (w) rather than {h). Thus it is reasonable to prefer {h) for further 
use, for instance presentation of the epidemiological evidence of a relation 
between over- and under-weight and depression. 

5.2 Interaction between HIV and the immune system: 
analysis of the ALBI data 

As an application of the proposed method, we analyzed the difference of risks 
between the model A^i and model A^2 described in section |2.2 using the 



data of a randomized clinical trial, the ALBI ANRS 070 trial (Molina et al., 
1999). This trial compared over 24 weeks the combination of zidovudine plus 
lamivudine (AZT+3TC) to that of stavudine plus didanosine (ddI+d4T). 
There were 50 patients in each arm. Measurements of CD4 and of HIV 
RNA were taken once a month up to six months. The likelihood, taking into 
account the detection limit of HIV RNA, was computed with the algorithm 
of Guedj, Thiebaut and Commenges (2007). The AIC for model Aii was 
equal to 1466.15 while for model A^2 AIC = 1026.63. The estimate of the 
variance was o;^ = 5.88. Thus the D statistic was equal to 4.40. However this 
applies to a multivariate outcome: we had seven measurements of viral load 
and of CD4 counts for each subject, that is 14 measurements per subjects. 
So the standardized value of D was 4.40/14 = 0.31. For the tracking interval 
we find [0.28; 0.35]. 

We can say with a good degree of confidence that the difference of risks 
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is larger than 0.28, a large difference as we have seen. This means that this 
difference between quiescent and activated CD4 is an important biological 
fact and that it must be taken into account, even though fitting the more 
complicated model is more challenging. 

6 Discussion 

We have proposed a statistic which tracks the difference of expected Kullback- 
Leibler risks between maximum likelihood estimators in two different models, 
A{g^", h'^"). Moreover we have an estimator of the variance of this statistic 
and we can construct a "tracking interval". We can also construct a con- 
fidence interval for A{g^'\ h""-''): the bounds of the latter are the bounds of 
the former shifted of {q — p)/2n. The results of our simulation study were 
in agreement with the asymptotic results. Our approach enlightens the un- 
avoidable variability of any criterion based on log-likelihood ratio such as 
AIC, BIG and their variants. This variability is generally not taken into ac- 
count and there is a misleading intuition that extrapolates the distribution of 
the likelihood ratio test to the variability of AIC. The distribution of the like- 
lihood ratio statistic is well approximated by a normal in the non-nested case 
while it is better approximated by a non-central chi-squared in the nested 
case. In both cases the variance is larger than that of the chi-squared with 
q — p dof, a distribution which holds only under the null hypothesis of the 
likelihood ratio test. 

In fine we can do more than simply choosing the estimator which has the 
lowest AIC. We can estimate the difference of risks and this has the same 
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meaning in different problems. We may become accustomed to considering 
differences of 10~^, 10~^, 10~^, 10~^ as negligible, small, moderate and large 
respectively, as we are accustomed to interpret correlation coefficients or 
odds-ratios for instance. More work is needed however to deepen our intuition 
about the magnitude of a difference of Kullback-Leibler risks. 

In the first application we have found that the quadratic model for the ef- 
fect of BMI on risk of depression was better than a linear model, although the 
difference between the two models was small. With the quadratic model both 
low and high BMI are at higher risk of depression. Our method gives argu- 
ments to prefer the quadratic model in BMI for presentation of the results to 
a more complex model obtaining a slightly better AIC In the application on 
comparing two HIV dynamics models, we found that the model distinguish- 
ing quiescent and activated CD4 was better than the simpler model which 
did not make this distinction. The estimated difference of risks was large 
and this has implications in future developments of HIV dynamics models. 

The statistic D and the tracking interval for the difference of risks are 
easy to compute and could be useful in a wide variety of apphcations. 
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Appendix: The HIV dynamics models 

To write the differential equation for the model, one uses assumptions which 
are plausible in view of the knowledge of the biological mechanisms: for 
instance we assume that new CD4 are produced (by the thymus) at a rate 
A, that only activated cells can be infected, that the probability of meeting 
of a cell and a virion is proportional to the product of their concentrations. 
A possible model (A^i) takes into account the uninfected and infected CD4, 
T and T* respectively, and the viral particles, V and is as follows: 

dft = {X-{l-r]l''^hTtVt-i2tft)dt 
dT; = [{l-r]l''^)^TtVt~i2T^T:]dt 
dVt = {fiT^'TfTt - fivVt)dt, 

where I^^ is indicates whether a treatment based on an inhibitor of the 

reverse transcriptase. 

Another model (7W2) distinguishes between quiescent (Q) and activated (T) 

CD4: 

29 



dQt = (A + pTt - aQt - fJ^QQt)dt 

dTt = {aQt-il-vl''^hTtVt-pTt-iXTTt)dt 

dT: = [{l-r]l''^)-fTtVt-fiT.T:]dt 

dVt = {fiT;TTT^ - pyVt)dt 



A statistical model is necessary to take into account that some parameters 
may differ from one subject to another and to link the observations to the 
ODE system. In model A^i the parameters A and n were random (adding 
other random parameters did not increase the likelihood). In model A^2 the 
parameters a, A and fiT* were considered as random. Measurements of the 
total numbers of CD4 and of number of viruses were available at times tij. 
We assumed the following observation equations: 



Yiji = logioiVi{tij,^ ) + VNiitij,^ )) + eiji, j < rii 

An additional complexity was that HIV RNA load was measured up to a 
detection limit. Guedj, Thiebaut and Commenges (2007) designed a special 
algorithm for computing and maximizing likelihood for this type of models. 
We refer the reader to this paper for more details. 
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Table 1: Order of magnitude of KL risks; the relative error is that for a 
typical underestimated event in a standard case; the sample size is the size 
which gives the corresponding statistical risk for estimating one parameter. 



Qualification 


KL scale 


Relative error 


Risk for estimation of one parameter 
Sample size 


Large 


10-1 


0.44 


5 


Moderate 


10-2 


0.14 


50 


Small 


10-3 


0.045 


500 


Negligible 


10-^ 


0.014 


5000 



Table 2: Simulation study: choice between tercile and linear model for the 
explanatory variable in a logistic regression model. 



n 


A 


D Q? 


Coverage rate 


Power 


250 
1000 


-9.28 10-3 
-7.78 10-3 


-9.50 10-3 1.88 10-2 
-7.67 10-3 1.54 10-2 


0.967 
0.954 


0.197 
0.514 
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Table 3: Upper part of the table: comparison of the linear, tercile and 
quadratic models for the effect of BMI on depression: D and the track- 
ing interval are with respect to the linear model. Lower part: comparison of 
the quadratic model with the model {w) including weight, height, weight^, 
height^ and 1 /height: D and the tracking interval are with respect to the 
quadratic model. 



Model 7^ parameters Likelihood AIC D Tracking interval 



Linear 


5 


Tercile 


6 


quadratic 


6 



-1346.2 2702.5 

-1345.6 2703.2 -0.0001 [-0.0009; 0.0007] 

-1342.9 2697.9 0.0007 [-2.10-^0.0029] 



quadratic 6 -1342.9 2697.9 

H 9 -1338.7 2695.5 0.0003 [-0.0016; 0.0022] 
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Figure 1: Graphical representation of HIV dynamics models: (a) model A^i 
including uninfected (T) and infected (T*) CD4+ T lymphocytes, and HIV 
viruses {V); (b) model A^2 including uninfected quiescent (Q), uninfected 
activated (T), infected (T*) CD4+ T lymphocytes, and HIV viruses {V). 
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Figure 2: Histogram of the values of D (which estimates the difference of 
Kullback-Leibler risks between the tercile and the hnear models) in the sim- 
ulation: upper figure, n = 250, lower figure, n = 1000. 
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Figure 3: Fit of the distribution of —2LR in the case of nested models, 
{g) C {h) (see section 4.2), by the non-central chi-squared distribution with 
q — p dof: (a) case of a "small" difference of risks (true distribution /^); (b) 
case of "moderate" difference of risks (true distribution p). 
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Figure 4: Estimated "effect" of the BMI on depression in the quadratic 
model: odds-ratios with respect to the probabihty at the median of BMI 
(24.2); the dots have for abscissas the observed BMI values. 
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